library(tidyverse)
data <- read.csv("data.csv")

Data Description

This dataset comes from the UC Machine Learning Repository Mice Protein Expression. It consists of expression levels of 77 proteins modifications that produced signals in the nuclear fraction of the cortex. The dataset includes measurements from 38 control mice and 34 trisomic mice (Down syndrome).

data

What are the columns represent

Mouse ID: Unique identifier for each mouse, 15 measurements were taken for each protein per mouse.

Expression Levels: The dataset contains values for the expression levels of 77 proteins.

Genotype: The genotype of each mouse, either control (c) or trisomy (t).

Treatment Type: The treatment administered to the mouse, either memantine (m) or saline (s).

Behavior: Describes the behavioral context of each mouse, either context-shock (CS) or shock-context (SC).

Classes

The mice are classified into eight groups based on features such as genotype, behavior, and treatment. The genotypes can either be control or trisomic. In terms of behavior, some mice have been stimulated to learn (context-shock), while others have not (shock-context). Additionally, in order to assess the effect of the drug memantine on the ability to learn in trisomic mice, some mice were injected with the drug, and others were injected with saline as control.

ggplot(data, aes(class)) +
  geom_bar(aes(fill = class), alpha = 0.8) +
  scale_fill_grey (start = 0.8, end = 0.2) +
  labs(title = "Number of Classes", y = "Count", x="Class")

  • c-CS-s: Control mice, stimulated to learn, injected with saline (9 mice)

  • c-CS-m: Control mice, stimulated to learn, injected with memantine (10 mice)

  • c-SC-s: Control mice, not stimulated to learn, injected with saline (9 mice)

  • c-SC-m: Control mice, not stimulated to learn, injected with memantine (10 mice)

  • t-CS-s: Trisomic mice, stimulated to learn, injected with saline (7 mice)

  • t-CS-m: Trisomic mice, stimulated to learn, injected with memantine (9 mice)

  • t-SC-s: Trisomic mice, not stimulated to learn, injected with saline (9 mice)

  • t-SC-m: Trisomic mice, not stimulated to learn, injected with memantine (9 mice)

Data Cleaning and handle missing values

The dataset contains some missing values. In this case, we will handle the missing values by replacing them with the mean of the corresponding protein group.

missing_values <- data %>%
  setNames(gsub("_N", "", names(.))) %>%
  summarise(across(everything(), ~ sum(is.na(.)), .names = "{.col}")) %>%
  pivot_longer(everything(), names_to = "key", values_to = "missing") %>%
  mutate(
    total = nrow(data),
    pct = (missing / total) * 100,
    isna = missing > 0
  ) %>%
  filter(isna) %>%
  arrange(desc(pct))

percentage.plot <- missing_values %>%
  ggplot(aes(x = reorder(key, pct), y = pct)) +
  geom_bar(stat = 'identity', fill = 'darkred', alpha = 1) +
  scale_x_discrete(limits = missing_values$key) +
  coord_flip() +
  labs(title = "Missing Values In the Dataset", x = 'Protein Name', y = "Percentage of Missing Values") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  guides(fill = "none")
percentage.plot

data <- data %>%
  group_by(class) %>%
  mutate(across(everything(), ~ replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
  as.data.frame()

names(data) <- gsub("_N", "", names(data))
proteins <- names(data[2:78])
classes <- as.vector(unique(as.character(data$class)))

Data Visualization

Histogram for expression levels

for(i in 1:length(proteins)){
  cat("####", proteins[i], "{.tabset .tabset-pills}\n")
  plot <- ggplot(data, aes(eval(parse(text = proteins[i])))) +
    geom_histogram(aes(fill = after_stat(count)), color = "black", alpha = 0.5) +
    scale_fill_gradient("Count", low = "blue", high = "red") +
    labs(title = proteins[i],
         x = "Expression Level",
         y = "Count") +
    theme_minimal()
  print(plot)
  cat('\n\n')
}
## #### DYRK1A {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ITSN1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### BDNF {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### NR1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### NR2A {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pAKT {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pBRAF {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pCAMKII {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pCREB {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pELK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pERK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pJNK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### PKCA {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pMEK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pNR1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pNR2A {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pNR2B {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pPKCAB {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pRSK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### AKT {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### BRAF {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### CAMKII {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### CREB {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ELK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ERK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### GSK3B {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### JNK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### MEK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### TRKA {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### RSK {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### APP {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### Bcatenin {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### SOD1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### MTOR {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### P38 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pMTOR {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### DSCR1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### AMPKA {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### NR2B {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pNUMB {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### RAPTOR {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### TIAM1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pP70S6 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### NUMB {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### P70S6 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pGSK3B {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pPKCG {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### CDK5 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### S6 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ADARB1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### AcetylH3K9 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### RRP1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### BAX {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ARC {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### ERBB4 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### nNOS {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### Tau {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### GFAP {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### GluR3 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### GluR4 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### IL1B {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### P3525 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pCASP9 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### PSD95 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### SNCA {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### Ubiquitin {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pGSK3B_Tyr216 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### SHH {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### BAD {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### BCL2 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pS6 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### pCFOS {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### SYP {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### H3AcK18 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### EGR1 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### H3MeK4 {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## 
## 
## #### CaNA {.tabset .tabset-pills}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.